{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. KL Divergence, Fisher Information, and the Natural Gradient\n",
    "\n",
    "## (a) Score function"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By the chain rule we have the following expression for the score function\n",
    "$$\\nabla_{\\theta'=\\theta} \\log p(y;\\theta') = \\frac 1{p(y,\\theta)}\\nabla_{\\theta'=\\theta}p(y;\\theta') $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Therefore the expected value of the score is \n",
    "$$\\begin{align*} \n",
    "\\mathbb E_{y\\sim p(y;\\theta)}\\left[\\nabla_{\\theta'=\\theta} \\log p(y;\\theta')\\right] &= \\int_{-\\infty}^\\infty p(y;\\theta)\\nabla_{\\theta'=\\theta} \\log p(y;\\theta') \\mathrm dy\\\\\n",
    "&=\\int_{-\\infty}^\\infty \\nabla_{\\theta'=\\theta} p(y;\\theta') \\mathrm dy\\\\\n",
    "&= \\nabla_{\\theta'=\\theta} \\int_{-\\infty}^\\infty p(y;\\theta') \\mathrm dy\\\\\n",
    "&= \\nabla_{\\theta'=\\theta} 1 = 0.\n",
    "\\end{align*}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## (b) Fisher Information\n",
    "\n",
    "Recall that if $\\mu := \\mathbb E_{y\\sim p(y)}[g(y)]$, then\n",
    "$$ \\text{Cov}_{y\\sim p(y)}[g(y)] := \\mathbb E_{y\\sim p(y)}[(g(y)-\\mu)(g(y)-\\mu)^T].$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Therefore by (a) we immediately get \n",
    "$$ \\mathcal I(\\theta)=\\mathbb E_{y\\sim p(y)}\\left[\\nabla_{\\theta'=\\theta} \\log p(y;\\theta')\\nabla_{\\theta'=\\theta} \\log p(y;\\theta')^T\\right].$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## (c) Fisher Information (alternate form)\n",
    "\n",
    "The Hessian is the derivative of the gradient. For any (tangent) vector $v$ of the same shape as $\\theta$ we can use the product rule (see my notes to problem set 0 if you need a refresher) to compute:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$\\begin{align*} \n",
    "\\left(\\nabla^2_{\\theta'=\\theta}\\log p(y;\\theta')\\right)v &= \\left(\\mathrm d_{\\theta'=\\theta} \\frac 1{p(y;\\theta')}\\cdot\\nabla_{\\theta'=\\theta} p(y;\\theta')\\right)v \\\\\n",
    "&= \\left( -\\frac 1{p(y;\\theta)^2}\\mathrm d_{\\theta'=\\theta} p(y;\\theta')\\right)v\\cdot \\nabla_{\\theta'=\\theta} p(y;\\theta')\n",
    "+\\frac 1{p(y;\\theta)}\\cdot \\left(\\nabla^2_{\\theta'=\\theta}p(y;\\theta')\\right)v\\\\\n",
    "&= \\nabla_{\\theta'=\\theta} p(y;\\theta') \\cdot \\left( -\\frac 1{p(y;\\theta)^2}\\mathrm d_{\\theta'=\\theta} p(y;\\theta')\\right)v\n",
    "+\\frac 1{p(y;\\theta)}\\cdot \\left(\\nabla^2_{\\theta'=\\theta}p(y;\\theta')\\right)v,\n",
    "\\end{align*}$$\n",
    "where in the last line we used that \n",
    "$$ \\left( -\\frac 1{p(y;\\theta)^2}\\mathrm d_{\\theta'=\\theta} p(y;\\theta')\\right)v \\in \\mathbb R$$\n",
    "is just a scalar."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Therefore the Hessian of the log-likelihood is given by \n",
    "$$\\begin{align*} \n",
    "\\nabla^2_{\\theta'=\\theta}\\log p(y;\\theta')\n",
    "&= \\nabla_{\\theta'=\\theta} p(y;\\theta') \\left( -\\frac 1{p(y;\\theta)^2}\\mathrm d_{\\theta'=\\theta} p(y;\\theta')\\right)\n",
    "+\\frac 1{p(y;\\theta)}\\nabla^2_{\\theta'=\\theta}p(y;\\theta')\\\\\n",
    "&= -\\frac 1{p(y;\\theta)}\\nabla_{\\theta'=\\theta} p(y;\\theta')\\cdot \\frac 1{p(y;\\theta)}\\mathrm d_{\\theta'=\\theta} p(y;\\theta')+\\frac 1{p(y;\\theta)}\\nabla^2_{\\theta'=\\theta}p(y;\\theta')\\\\\n",
    "&=-\\nabla_{\\theta'=\\theta}\\log p(y;\\theta') \\cdot \\nabla_{\\theta'=\\theta}\\log p(y;\\theta')^T + \\frac 1{p(y;\\theta)}\\nabla^2_{\\theta'=\\theta}p(y;\\theta').\n",
    "\\end{align*}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By (b) the expected value of the left summand \n",
    "$$-\\nabla_{\\theta'=\\theta}\\log p(y;\\theta') \\cdot \\nabla_{\\theta'=\\theta}\\log p(y;\\theta')^T $$\n",
    "is $-\\mathcal I(\\theta)$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By essentially the same proof as in (a) (replace $\\nabla$ with $\\nabla^2$), the expected value of the right summand \n",
    "$$ \\frac 1{p(y;\\theta)}\\nabla^2_{\\theta'=\\theta}p(y;\\theta')$$\n",
    "is $0$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Putting everything together we find\n",
    "$$ \\mathbb E_{y\\sim p(y;\\theta)}\\left[-\\nabla^2_{\\theta'=\\theta}\\log p(y;\\theta')\\right] = \\mathcal I(\\theta),$$\n",
    "as claimed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## (d) Approximating $D_\\text{KL}$ with Fisher Information"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Taylor expansion of the log-likelihood gives\n",
    "$$\\begin{align*} \n",
    "\\log p({y;\\theta+d}) &\\approx \\log p({y;\\theta}) + d^T\\nabla_{\\theta'=\\theta}\\log p({y;\\theta}) + \\frac 1 2 d^T \\left(\\nabla^2_{\\theta'=\\theta}\\log p({y;\\theta} )\\right)d.\n",
    "\\end{align*}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Therefore by (a) and (c)\n",
    "$$\\begin{align*} \n",
    "D_\\text{KL}(p_\\theta \\| p_{\\theta+d}) &= \\mathbb E_{y\\sim p(y;\\theta)}\\left[\\log p({y;\\theta})- \\log p({y;\\theta+d})\\right]\\\\\n",
    "&\\approx \\mathbb E_{y\\sim p(y;\\theta)}\\left[-d^T\\nabla_{\\theta'=\\theta}\\log p({y;\\theta})- \\frac 1 2 d^T \\left(\\nabla^2_{\\theta'=\\theta}\\log p({y;\\theta}) \\right)d\\right]\\\\\n",
    "&= -d^T\\cdot 0 + \\frac 1 2 d^T \\mathcal I(\\theta) d\\\\\n",
    "&=\\frac 1 2 d^T \\mathcal I(\\theta) d\n",
    "\\end{align*}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## (e) Natural Gradient\n",
    "\n",
    "After using the Taylor approximations for $\\ell$ and $D_\\text{KL}$, we have to solve the optimization problem"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "$$ d^* = \\arg \\max_d d^T\\nabla_{\\theta'=\\theta}\\ell(\\theta') \\quad \\text{subject to}\\quad \\frac 1 2 d^T \\mathcal I(\\theta) d = c.$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Lagrangian for this problem is given by \n",
    "$$ \\mathcal L(d,\\lambda) =d^T\\nabla_{\\theta'=\\theta}\\ell(\\theta') -\\lambda \\left(\\frac 1 2 d^T \\mathcal I(\\theta) d-c\\right).$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This leaves us with the equations (note that $\\mathcal I(\\theta)$ is a symmetric matrix)\n",
    "$$ \\begin{align*}\n",
    "\\nabla_d \\ell(d,\\lambda) &= \\nabla_{\\theta'=\\theta}\\ell(\\theta') -\\lambda \\mathcal I(\\theta)d &=0,\\\\\n",
    "\\nabla_\\lambda \\ell(d,\\lambda) &= - \\frac 1 2 d^T \\mathcal I(\\theta) d+c &=0.\n",
    "\\end{align*}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the first equation we find\n",
    "$$ d = \\frac 1 \\lambda \\mathcal I(\\theta)^{-1}\\nabla_{\\theta'=\\theta}\\ell(\\theta') =: \\tilde d $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plugging this into the second equation gives (note that $\\mathcal I(\\theta)^{-1}$ is also symmetrical)\n",
    "$$\\begin{align*}\n",
    "2c &= d^T \\mathcal I(\\theta) d\\\\\n",
    "&=\\left(\\frac 1 \\lambda \\mathcal I(\\theta)^{-1}\\nabla_{\\theta'=\\theta}\\ell(\\theta')\\right)^T \\mathcal I(\\theta) \\frac 1 \\lambda \\mathcal I(\\theta)^{-1}\\nabla_{\\theta'=\\theta}\\ell(\\theta')\\\\\n",
    "&= \\frac 1 {\\lambda^2}\\left(\\nabla_{\\theta'=\\theta}\\ell(\\theta')\\right)^T  \\mathcal I(\\theta)^{-1}\\nabla_{\\theta'=\\theta}\\ell(\\theta').\n",
    "\\end{align*}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Because $ \\mathcal I(\\theta)^{-1}$ is positive definite, we can take the squareroot to find\n",
    "$$ \\lambda = \\sqrt{\\frac 1 {2c}\\left(\\nabla_{\\theta'=\\theta}\\ell(\\theta')\\right)^T  \\mathcal I(\\theta)^{-1}\\nabla_{\\theta'=\\theta}\\ell(\\theta')}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plugging this back into the expression for $\\tilde d$, we finally get \n",
    "$$ d^* = \\frac {\\mathcal I(\\theta)^{-1}\\nabla_{\\theta'=\\theta}\\ell(\\theta')} {\\sqrt{\\frac 1 {2c}\\left(\\nabla_{\\theta'=\\theta}\\ell(\\theta')\\right)^T  \\mathcal I(\\theta)^{-1}\\nabla_{\\theta'=\\theta}\\ell(\\theta')}}. $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## (f) Relation to Newton's Method\n",
    "\n",
    "In a GLM we have a random variable $Y$ that given data $x$ is exponentially distributed with density\n",
    "$$ p(y\\mid x;\\theta) = b(y)\\exp(\\eta T(y)-a(\\eta)).$$\n",
    "with $\\eta = \\theta^T x$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "On problem set 1 we saw that the Hessian of $\\ell(\\theta) = p(y\\mid x;\\theta)$ is given by\n",
    "$$ \\nabla^2_{\\theta'=\\theta}\\ell(\\theta) = -\\text{Var}(Y) xx^T. $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Therefore the update step in Newton's method is given by \n",
    "$$ d_\\text{N} = -\\left(\\nabla^2_{\\theta'=\\theta}\\ell(\\theta)\\right)^{-1}\\nabla_{\\theta'=\\theta}\\ell(\\theta') = \\left(\\text{Var}(Y)  x x^T\\right)^{-1}\\nabla_{\\theta'=\\theta}\\ell(\\theta').$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "On the other hand, by (c)\n",
    "$$ \\mathcal I(\\theta) = \\mathbb E_{y\\sim p(y\\mid x;\\theta)}\\left[-\\nabla^2_{\\theta'=\\theta}\\ell(\\theta)\\right] = \\text{Var}(Y)  x x^T.$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Hence \n",
    "$$ \\tilde d = \\frac 1 \\lambda \\mathcal I(\\theta)^{-1}\\nabla_{\\theta'=\\theta}\\ell(\\theta') =\\frac 1 \\lambda d_\\text{N},$$\n",
    "which means that the natural gradient has the same direction as the Newton step."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
